In today's post, we will explain a certain algorithm for matrix factorization models for recommender systems which goes by the name Alternating Least Squares (there are others, for example based on stochastic gradient descent). We will go through the basic ALS algorithm, as well as how one can modify it to incorporate user and item biases.
We will also go through the ALS algorithm for implicit feedback, and then explain how to modify that to incorporate user and item biases. The basic ALS model and the version from implicit feedback are discussed in many places (both online and in freely available research papers), but we aren't aware of any good source for implicit ALS with biases... hence, this post.
Recommendation engines are becoming ubiquitous in the modern consumer landscape. From Netflix's personalized movie recommendations to Amazon's raft of suggested items (New For Your, Recommendations for You in the Kindle Store, etc...), to Spotify's wildly popular personalized Discover Weekly Playlist, recommendation engines have become an essential tool for promoting content discovery and extending consumer engagement.
There has been a proliferation of algorithms in the data science community upon which recommender systems are built, but they all follow the basic central dogma that similar users like similar items. The hard part, of course, is deciding exactly what "similar" means and how to measure it. For Netflix, similar users are users who rate movies similarly. For Spotify, similar users are users who listen to the same kinds of music (a little thought will convince you that these two ideas, while similar, are not really the same).
The simplest type of recommender system, called a collaborative filter, is essentially a literal interpretation of "similar users like similar things" Goldberg1992. Once a metric for similarity is decided on, one just looks at the top rated items of the $k$ nearest neighbors (with respect to the similarity metric), filters out those items which the user of interest has already seen/consumed, and voila: personalized recommendations.
Collaborative filters have the advantages of being relatively easy to implement and relatively easy to interpret (e.g. we recommended movie A to user 1 because users 2, 3, and 4 all rated it highly and have similar ratings profiles to user 1). This being the Age of Data Science, though, it is no longer enough to have a reasonable explanation. We can actually measure how effectively a recommender system predicts the ratings of users on unseen items. Thus began Netflix's famous quest for better recommender systems, leading to the famous million dollar prize and eventual solution Bell2008 (by a team of experts working for several years).
Although the recommender system that won the Netflix prize was never put into production (it was a carefully tuned mix of many statistical models, and the marginal gain obtained by implementing all of the models was not worth the technical price compared to simply implementing a few of the best ones), the data science community, and the wider commercial world, learned without a doubt that, while collaborative filters are explainable and easily implementable, one can do significantly better if one is willing to flex a bit of mathematical muscle.
There were essentially two types of recommender systems in the final solution to the Netflix prize, and they have become the bread and butter of the current state of the art: matrix factorization models and restricted Bolzmann machines (RBMs). We will discuss matrix factorization models in this post.
We will be interested in two refinements of the basic matrix factorization model for recommendations: using implicit feedback, and using user and item biases.
It was realized early on, even for collaborative filters, that recommender systems work a lot better if one accounts for user and item biases.
Suppose, for example, that we are trying to predict movie ratings and
Clearly 3 stars from Bob is very different than 3 stars from Alice. If we keep track of the average movie rating for each user (the user bias), and just try to predict the difference from that average, the recommender system works much better.
Similarly, suppose
Then a 3 star rating is very different for Snakes on a Plane than for The Shawshank Redemption. If we keep track of the average user rating for each movie (the item bias), and just try to predict the difference from that average, the recommender system also works much better.
Obviously, one should keep track of both user and item biases. (We will explain the technical details below.)
One of the challenges of recommender systems in the wider commercial world is that one rarely has explicit ratings data (Netflix strongly encourages users to rate movies and leverage the tech built around those ratings).
Instead, user-item interactions like clicks, purchases, song listens (or fast-forwards), etc, are used as a proxy to indicate a preference or distaste for a particular item (possibly very weakly). If a user listens to a song only once, maybe it indicates that they didn't like the song. However, maybe they were busy or weren't paying attention and actually would like to hear the song again. If a user listens to a song 100 times, it is a safe bet that they like the song. But it is hard to draw a line between unknown preference and known preference.
This kind of indirect information about user-item preferences is known as implicit feedback, and many smart people have thought long and hard about how to deal with it. We won't delve much into the modeling of implicit feedback in today's post. It turns out to be relatively effective to model a user's preference on a scale of 0 (bad) to 1 (good) instead of modeling the user's raw number of interactions (or whatever number is actually being measured).
We can interpret the number of user-item interations (song listens, for example) as a measure of our confidence in our model's prediction for the user's preference of the item.
Below, we'll step through the details of how a matrix factorization model can be used to deal with implicit feedback. The basic algorithm we discuss is from the seminal paper Hu2008.
The most basic matrix factorization model for recommender systems models the rating $\hat{r}$ a user u
would give to an item i
by
where $x_u^T = (x_u^1, x_u^2, \dots, x_u^N)$ is a vector associated to the user, and $y_i^T = (y_i^1, y_i^2, \dots, y_i^N)$ is a vector associated to the item. The dimension of the vectors is the rank of the model, and the components are called factors.
We can collect the user-item ratings into a matrix $\widehat{R}=(\hat{r}_{ui})$. First collect the user vectors into a matrix
$$X^T = \left(\begin{matrix} \vdots & \vdots & \cdots & \vdots \\
x_{u_1} & x_{u_2} & \cdots & x_{u_{n_\text{users}}} \\
\vdots & \vdots & \cdots & \vdots \end{matrix}\right),$$
and the item vectors into a matrix
$$Y^T = \left(\begin{matrix} \vdots & \vdots & \cdots & \vdots \\
y_{i_1} & y_{i_2} & \cdots & y_{i_{n_\text{items}}} \\
\vdots & \vdots & \cdots & \vdots \end{matrix}\right),$$
then we can express the above model as
Of course, there is no reason that the "true" user-item matrix $R = (r_{ui})$ should have rank $N$, and indeed, we won't even know the "true" user-item matrix. This model assumes that it can be approximated by a rank $N$ factorization:
$$ R \sim \widehat{R} = XY^T.$$In other words, we want to find an $n_\text{users}\times N$ matrix $X$ and an $n_\text{items}\times N$ matrix $Y$ such that $\widehat{R}:=XY^T$ approximates $R$.
Remark We follow the matrix conventions of Hu2008, and machine learning in general. Although the user- and item-factor vectors $x_u$ and $y_i$ (and vectors in general) are regarded as column vectors, they are arranged as rows into the composite matrices $X$ and $Y$.
One popular method for finding the matrices $X$ and $Y$ given partial information about $R$ is known as alternating least squares. The idea is to find the parameters $x^j_{u}$ and $y^j_{i}$ (the entries of the matrices $X$ and $Y$) which minimize the $L^2$ cost function
$$ C = \sum_{u,i\in\text{observed ratings}} (r_{ui} - x_u^T y_i)^2 + \lambda \left( \sum_{u} \|x_u\|^2 + \sum_{i} \|y_i\|^2 \right).$$The constant $\lambda$ is called the regularization parameter and essentially penalizes the components of the matrices $X$ and $Y$ if they get too large (in magnitude). This is important for numerical stability (and some kind of regularization is almost always used). It has an even more important effect in this setting, though:
Fundamental Observation: If we hold the item vectors $Y$ fixed, $C$ is a quadratic function of the components of $X$. Similarly, if we hold the user vectors $X$ fixed, $C$ is a quadratic function of the components of $Y$.
So in order to minimize $C$, one could try the following:
Some remarks are in order.
The above algorithm is called (for obvious reasons) alternating least squares.
A bit of linear algebra yields the following algorithm (see here, for example):
i
, let $r_i$ be the vector of ratings of that item (it will have n_users
components; one for each user). Compute
$$y_i = \left(X^T X + \lambda I\right)^{-1} X^T r_i$$
for each item i
. ($I$ is the $N\times N$ identity matrix.)u
, let $r_u$ be the vector of ratings of that user (it will have n_items
components; one for each item).
Compute
$$x_u = \left(Y^T Y + \lambda I\right)^{-1} Y^T r_u$$
for each user u
.Remark. It is almost always faster to solve the linear system $\left(X^T X + \lambda I\right)y_i = X^T r_i$ instead of inverting the matrix $\left(X^T X + \lambda I\right)$ as written in the algorithm (similarly for the user vectors).
As we mentioned above, it turns out that most recommender systems perform better if user and item biases are taken into account.
Suppose we have a ratings system that allows each user to rate each item on a scale of 1 to 5 stars. Suppose we have two users: Alice
, who rates items with an average of 4
stars, and Bob
, whose average rating is 1.5
stars. If Bob
rates some new item with 3
stars, it means something very different than if Alice
rates the same item with 3
stars (Bob
really liked the new item, Alice
didn't). The difference is what we call user bias.
For an explicit-ratings model (as discussed above, when we have explicit ratings for user-item pairs), one way to account for user bias is to model the actual user-item ratings as
$$r_{ui} \sim \beta_u + \hat{r}_{ui},$$where $\beta_u$ is the user bias of user u
and $\hat{r}_{ui}$ is the model of whatever is left over; for example, $\hat{r}_{ui} = x^T_uy_i$ if we use the matrix factorization model discussed above. We can account for item bias in a similar way:
here, $\gamma_i$ is the item bias of item i
.
Remark. If instead of matrix factorization we used a traditional collaborative filter, we would define $\hat{r}_{ui}$ via $k$-nearest neighbors using some similarity metric (Pearson is a standard choice), $\beta_u$ is the average rating of user u
over all items, and $\gamma_i$ is the average rating of item i
over all users.
User and item biases can be directly incorporated into the ALS algorithm. We model the user-item ratings matrix as
$$r_{ui} \sim \beta_u + \gamma_i + x^T_u y_i$$and minimize the cost function
$$ C^\text{biased} = \sum_{u,i\in\text{observed ratings}} (r_{ui} - x_u^T y_i)^2 + \lambda \left( \sum_{u} \left( \|x_u\|^2 + \beta_u^2\right) + \sum_{i} \left(\|y_i\|^2 + \gamma_i^2\right) \right).$$Again, because of the regularization, we can hold the user variables fixed and solve for the minimum in the item variables. Then we can hold the item variables fixed and solve for the minimum in the user variables.
The biases $\beta_u$ and $\gamma_i$ appear in $C^\text{biased}$ without any coefficients, whereas all the other parameters appear at least once with some coefficient. So one approach would be to solve for the biases separately from the other parameters in each step.
It is easier, though, to leverage the work we've already done. We can rewrite our cost function at each step so that it looks like an unbiased model, and then just use the same formulas we found above for the unbiased ALS. The trick is to define new vectors that include the biases as components in the right way.
Let $\beta$ be the vector of user biases (with n_users
components) and $\gamma$ the vector of item biases (with n_items
components). Here is the algorithm:
i
, define three new vectors :
$$r^\beta_i := r_i - \beta$$
with components $r^\beta_{ui}:= r_{ui}-\beta_u$ (notice that both vectors have n_users
components),
$$\tilde{x}^T_u := (1, x^T_u),$$
and
$$\tilde{y}^T_i := (\gamma_i, y^T_i).$$
Then $C^\text{biased} = \sum \left(r^\beta_{ui} - \widetilde{x}_u^T\widetilde{y}_i \right)^2 + \lambda\left( \sum_u \|\widetilde{x}_u\|^2 + \sum_i \|\widetilde{y}_i\|^2\right)$. Hence, we find that the item bias and vector can be computed as
$$ \widetilde{y}_i := \left(\begin{matrix} \gamma_i \\ y_i\end{matrix}\right) = \left(\widetilde{X}^T \widetilde{X} + \lambda I\right)^{-1} \widetilde{X}^T r^\beta_i.$$
($I$ is now the $(N+1)\times(N+1)$ identity matrix, and $\tilde{X}$ and $\tilde{Y}$ are matrices whose columns are the vectors $\tilde{x}_u$ and $\tilde{y}_i$ as usual.)u
, define three new vectors:
$$r^\gamma_u := r_u - \gamma,$$
$$\widetilde{y}^T_i := (1, y_i),$$
and
$$\widetilde{x}^T_u := (\beta_u, x_u).$$
The user bias and vector can be computed as
$$ \widetilde{x}_u := \left(\begin{matrix} \beta_u \\ x_u\end{matrix}\right) = \left(\widetilde{Y}^T \widetilde{Y} + \lambda I\right)^{-1} \widetilde{Y}^T r^\gamma_u.$$For many real-world user-item interactions there are no explicit rating data available. However, there is often nontrivial information about the interactions, e.g. clicks, listens/watches, purchases, etc. Such indirect "ratings" information about user-item interactions is known as implicit feedback. Modeling implicit feedback is a difficult but important problem. There are several ways to use the ALS matrix factorization to approach such a model. We present here a standard solution, presented (without bias corrections) in Hu2008.
The basic approach is to forget about modeling the implicit feedback directly. Rather, we want to understand whether user u
has a preference or not for item i
using a simple boolean variable which we denote by $p_{ui}.$ The number of clicks, listens, views, etc, will be interpreted as our confidence in our model.
Following the fundamental idea of matrix factorization models, we try to find a user vector $x_u$ for each user u
and an item vector $y_i$ for each item i
so that
$$p_{ui} \sim x^T_u y_i.$$
It is important to note that we never actually observe $p_{ui}$! (This is a very different situation than the explicit feedack models, as discussed above, where $r_{ui}$ is the observer rating.)
Let's assume the our implicit feedback is a positive integer (number of clicks, number of views, number of listens, etc). That is, $$r_{ui} = \text{# of times user }\mathtt{u}\text{ interacted with item }\mathtt{i}.$$
How do we go about finding the vectors $x_u$ and $y_i$ given some implicit feedback $\{r_{ui}\}$? If a user has interacted with an item, we have reason to believe that $p_{ui}=1$. The more they have interacted with that item, the stronger our belief in their preference.
To define our model, set $$ p_{ui} = \begin{cases} 1 & \text{if }r_{ui}>0\\ 0 & \text{if }r_{ui}=0.\end{cases}$$ We try to minimize the following cost function: $$ C_\text{implicit} := \sum_{u,i\in\text{observed interactions}} c_{ui}\left(p_{ui} - x^T_u y_i \right)^2 + \lambda \left(\sum_u \|x_u\|^2 + \sum_i \|y_i\|^2\right),$$ where $c_{ui}$ is our confidence in $p_{ui}.$ That is, the more a user has interacted with an item, the more we penalize our model for incorrectly predicting $p_{ui}.$
If a user has never interacted with an item, it is possible that $p_{ui}=1$ and the user has just never encountered the item. It is also possible they are actively avoiding the item. To deal with this ambiguity, it is common to define the confidence by $$c_{ui}:=1 + \alpha r_{ui},$$ where $\alpha$ is a parameter of the model that needs to be tuned to the dataset. There is some empirical evidence that setting it to the sparsity ratio (the ratio of nonzero entries to zero entries) works well; missing entries are often interpreted as slightly negative, and so one could interpret $\alpha$ as balancing positive and negative interactions. See Johnson2014, for example. Another possibility that works well (and makes the model more robust against power users -- those users who interact with items orders of magnitude more often than the average user), which is also mentioned in Hu2008, is to set $$c_{ui} = 1 + \alpha\log(1+r_{ui}/\epsilon),$$ where $\epsilon$ is yet another data-dependent parameter of the model.
The basic idea of the ALS algorithm for matrix factorization applies to minimizing $C_\text{implicit}$: if we hold the user (resp. item) vectors fixed, then $C_\text{implicit}$ depends quadratically on the item (resp. user) vectors. That means we can do the same thing: hold the user vectors fixed and solve for the minimum in the item variables, then hold the item vectors fixed and solve for the minimum in the user variables, and repeat...
Of course, our cost function $C_\text{implicit}$ is slightly different, so the actual steps involving solving for the minimum look a bit different. The algorithm is as follows.
i
, let $p_i$ be the vector whose components are $p_{ui}$ (for fixed i
, there will be n_users
components), let $C^i$ be the diagonal matrix with $c_{ui}$ for fixed $u$ along the diagonal, and let $d_i = C^i p_i$ (the reason for defining $d_i$ separately will be explained in the remarks about the numpy
implementation below). Compute
$$y_i = \left(X^TC^iX+\lambda I\right)^{-1}X^Td_i.$$u
, let $p_u$ be the vector whose components are $p_{ui}$ (for fixed u
; there will be n_items
components), let $c_{u}$ be the vector whose components are $c_{ui}$, let $C^u$ be the diagonal matrix with $c_u$ along the diagonal, and let $d_u=C^u p_u$. Compute
$$x_u = \left(Y^TC^uY + \lambda I\right)^{-1}Y^Td_u.$$Remarks
i
, and can be computed once and reused at each step of the algorithm (similarly for $Y^TY+\lambda I$ in step 3).It turns out that the matrix multiplication $X^TC^iX$ and the matrix-vector product $X^Td_i$ are both easier to implement in numpy
than to express in standard linear algebra notation. A direct implementation of the algorithm is of course possible, but a small bit of thought improves performance vastly.
numpy
Implementation NotesA direct implementation of the matrix product $X^TC^iX$ would be something like
Ci = np.diag(c[i])
XTCiX = np.dot(X, np.dot(Ci, np.dot(X.T)).
But the matrix product np.dot(X, Ci)
is just multiplying the rows of $X$, in order, by the diagonal elements of $C^i$, which is achieved very efficiently by simple numpy
broadcasting. So we can achieve the same matrix product via
XTCiX = np.dot(c[i] * X.T, X).
Moreover, since $d_i = p_i + \alpha r_i$ (which follows from $p_{ui} r_{ui} = r_{ui}$), we can rewrite the algorithm in terms of just the scaled raw counts $\alpha R$ and the preference $p$ as $$y_i = \left(X^TX+\lambda I + \alpha X^TR^iX\right)^{-1}X^T(p_i + \alpha r_i).$$
Hence, we can completely avoid constructing the diagonal matrices $C^i$ and $C^u$.
Finally, when n_users
and n_items
are large, $R$ (and hence $p$) will generally be sparse matrices. The matrix $\left(X^TX+\lambda I + \alpha X^TR^iX\right)$ is, however, $N\times N$ and since the number of factors is generally small (on the order of 10), there is no great savings by using a sparse implementation of this matrix. On the other hand, only the user vectors corresponding to nonzero entries of $r_i$ (and $p_i$) are needed in the computation of $X^TR^iX$ and of $X^T(p_i + \alpha r_i)$, so if the algorithm is parallelized, it is very useful to only ship those vectors to the workers doing the computation for $y_i$ (this is, for example, how the Spark implementation works). Our implementation is not parallelized (for the sake of clarity), so we don't need to worry about shipping data to workers.
As with the standard matrix factorization model, matrix factorization models for implicit feedback tend to work better if user and item biases are accounted for; we introduce parameters $\beta_u$ and $\gamma_i$ as before, and try to fit a model $$p_{ui} \sim \beta_u + \gamma_i + \hat{p}_{ui},$$ where $\hat{p}_{ui} = x^T_u y_i$ as above, by minimizing the cost function $$C_{implicit}^{biased} := \sum_{u,i\in\text{observed interactions}} c_{ui}\left(p_{ui} - \beta_u - \gamma_i - x^T_u y_i \right)^2 + \lambda \left(\sum_u (\|x_u\|^2 + \beta_u^2) + \sum_i (\|y_i\|^2 + \gamma_i^2)\right).$$
The interpretation of the biases in this case is slightly different:
Again, the principle of the ALS algorithm applies with appropriate modifications for the new cost function $C^\text{biased}_\text{implicit}$. The algorithm is:
The same remarks apply as for biased ALS and implicit ALS.
In a future post we will compare the matrix factorization algorithms we discussed here on real-world data and see that, indeed, accounting for user and item biases is worth the extra effort.
I would like to thank Josh Hemann and Spencer Stirling for carefully reading drafts of this post and making many useful suggestions. I would also like to thank Sergiy Matusevych for finding some important typos and making hepful comments.
D. Goldberg, D. Nichols, B. M. Oki, D. Terry, Using Collaborative Filtering to Weave an Information Tapestry, Comm. ACM 35 (1992) pp 61 -- 70.
R. M. Bell, Y. Koren, C. Volinsky, The BellKor 2008 Solution to the Netflix Prize
Y. Hu, Y. Koren, C. Volinsky Collaborative Filtering for Implicit Feedback Datasets 8th IEEE International Conference on Data Mining, pp. 263 -- 272.
C. Johnson Logistic Matrix Factorization for Implicit Feedback Data